US Name

Data preparation combine table information and journals

Description

None

Merge

Main table

papers_meta_analysis_new

Merged with:

Target

Metadata

Input

Table/file

Name

Github

Prepare query

Write query and save the CSV back in the S3 bucket datalake-datascience

Steps

Merge journal and papers table

Missing journals

Currently, the missing values come from the rows to check in METADATA_TABLES_COLLECTION

Explain missings:

Date

location

peer_reviewed

adjusted_independent

Save data to Google Spreadsheet for sharing

Author information

During a presentation (Desir Seminar), it has been pointed out that characteristic of an author might impact the desire results. Most of the information are available from the internet.

We use two sources of information:

Using both data sourcs, we will retrieve or compute the following information:

The workflow works in three steps:

  1. Train gender detection model using US name from the public dataset USA Names
  2. Download paper and author information from the spreadsheet CSR Excel File Meta-Analysis - Version 4 - 01.02.2021
  3. Fusion paper and author informations. The final table has the size of number of papers x number of authors per paper.

Train deep learning model gender detection

The first step of the workflow consists to train a basic LSTM architecture to deter the gender from family name.

Training the model requires the following steps:

  1. Download the data from Google Cloud Platform
  2. Lowercase first name, split character and convert them to numerical value
  3. Train the model using a vector embedding, Bidirectional LSTM layer and a dense layer to compute the prediction

Download data

The data comes from the public dataset USA Names

Clean up and pre-process

  1. Lowercase the name since each character’s case doesn’t convey any information about a person’s gender.
  2. Split each character: The basic idea of the ML model we’re building is to read characters in a name to identify patterns that could indicate masculinity or femininity. Thus we split the name into each character.
  3. Pad names with empty spaces until a max of 50 characters ensures the ML model sees the same length for all the names.
  4. Encode each character to a unique number since ML models can only work with numbers. In this case, we encode ‘ ’ (space) to 0, ‘a’ to 1, ‘b’ to 2, and so on.
  5. Encode each gender to a unique number since ML models can only work with numbers. In this case, we encode ‘F’ to 0 and ‘M’ to 1.

For example, the preprocessing step does the following:

Take the name "mary", the character "m" is given the number 13, the character "a" is 1, and so one. The 0s are the padding because the matrix should have the same dimension

image.png

Model Architecture

  1. Embedding layer: to “embed” each input character’s encoded number into a dense 256 dimension vector. The choice of embedding_dim is a hyperparameter that can be tuned to get the desired accuracy.
  2. Bidirectional LSTM layer: read the sequence of character embeddings from the previous step and output a single vector representing that sequence. The values for units and dropouts are hyperparameters as well.
  3. Final Dense layer: to output a single value close to 0 for ‘F’ or close to 1 for ‘M’ since this is the encoding we used in the preprocessing step.

Note: Embedding layer enables us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one with having real values instead of just 0’s and 1’s. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions

The idea is to turns positive integers (indexes) into dense vectors of fixed size. Then this layer can be used as the first layer in a model.

The size of the vocabulary (the list from the preprocessing) is equal to 27: The alphabet has 26 letters, and the space characters. We want the output layer to be a vector of 256 weights. input_length is the maximum size of the name. We can set it up since the length of input sequences is constant.

Here is an example of how the vector embedding output looks like

Note: In this example we have not trained the embedding layer. The weights assigned to the word vectors are initialized randomly.

The embedding vector for the first word is:

Training the Model

We’ll use the standard tensorflow.keras training pipeline as below

  1. Instantiate the model using the function we wrote in the model architecture step.
  2. Split the data into 80% training and 20% validation.
  3. Call model.fit with EarlyStopping callback to stop training once the model starts to overfit.
  4. Save the trained model
  5. Plot the training and validation accuracies to visually check the model performance.

Download authors and paper information

To get the most information possible about an author, we need to rely on two differents data source:

Both data sources are accessible through an API.

However, Google scholar does not always return the author information, and the spelling is different in both data sources.

In this steps, we will begin with fetching data from Scemantic scholar, then Google scholar.

Semantic scholar

Our primary objective is to get the information about the gender, but also to evaluate the expertise of an author about ESG. The data source Semantic Scholar has 198,182,311 papers from all fields of science.

Our strategy is to use the API to search for a paper in order to get the related information (DOI, cite, performance) and more importantly, the ID of the author(s). Indeed, to get information about an author, we need to know his/her ID. As soon as we have the ID, we can collect and compute all other information (i.e. gender and expertise)

The workflow is the following:

  1. Find the paper's ID in Semantic scholar from the spreadsheet CSR Excel File Meta-Analysis - Version 4 - 01.02.2021
  2. Fetch paper information using the ID, including the list of authors
  3. Fetch author information using the author ID (including all the authors publication)
  4. Use the author name and aliases to predict the gender
  5. Save the results in S3: DATA/JOURNALS/SEMANTIC_SCHOLAR/PAPERS

Get paper name

Below is an example with the following paper:

image.png

the authors are:

and the DOI is "10.1002/(SICI)1097-0266(199704)18:4303::AID-SMJ8693.0.CO;2-G"

In the next steps, we want to predict the gender of the author. The first author is S. Waddock which is impossible to detect the gender because only one letter displays for the first name. Therefor, we will combine the first name with all the aliases. We add another constraint, the first name should have more than 2 characters:

Then we push all the candidates to the model, and return the average probability. The model gives an average probability of 43%, meaning the author is a female.

Get full list of information

Failure:

Google scholar

The strategy with Google Scholar is the same as Semantic scholar:

  1. Use the list of paper's DOI from Semantic scholar to find the paper information and author ID
  2. Use the author ID to collect information
  3. Predict gender

Reconstruct paper-authors tables

The final step consists to bring together the paper informations with the authors informations. In the end, we want a table with all the authors information for a given paper. Therefore, the final table has the following dimension: number of papers x number of authors.

To construct the table, we need to proceed in three steps:

  1. Combine the author information from Semantic scholar and Google Scholar, and construct the author expertise
  2. Combine the paper information from Semantic scholar and Google Scholar
  3. Merge steps 1 and 2

The data is saved in Google spreadsheet for validation purposes: AUTHOR_SEMANTIC_GOOGLE

Author information from Semantic scholar and Google Scholar

We mentionned earlier that there is no direct link between the author name in Semantic scholar and Google scholar for two reasons. First of all, the reason comes from different spelling. Secondly, not all authors have information in Google scholar

We also need to construct the authors expertise in ESG.

In the part, we will reconstruct the list of author information from Semantic scholar and Google Scholar and construct the expertise following these steps:

  1. Import list of authors from Google Scholar
    • construct list of interest
    • grab the email extension
  2. Identify same authors in Semantic scholar and Google scholar. There are 266 authors in Semantic scholar and 146 in Google scholar
    • Use levensthein/Hamming distance and similarity algoritmh to find similar authors

Import list of authors from Google Scholar

Identify same authors in Semantic scholar and Google scholar

To identify the same authors in the list of authors from Semantic scholar and Google scholar, we use the following sets of rules:

  1. Construct a cartesian matrix between the list of authors fromSemantic scholar and Google scholar (38836 rows)
    • image.png
  2. Compute levensthein and Hamming distance
    • Within author from Google, get the lowest levensthein and max Hamming
  3. Filter rows where the lowest score from the levensthein distance equals the levensthein distance
  4. Compute the similary algorithm
  5. Search for best match: 124 exact match
    • Compute the number of candidates within Google authors
    • Filter when number of candidates equals to 1 (one choice only) and similarity score above .1
    • Filter when max similarity with Google author equals similarity
  6. Search for second best match: 7 exact match
    • Exclude authors from Semantic scholar
    • Exclude rows with similarity equals to 0
    • Compute minimum Hamming distance within Google authors
    • Keep when minimum Hamming distance equals Hamming distance

Perfect match image.png

Un-perfect or disimilar match

image.png

We use the similarity API to compute the similarity score

Filter 1: Potential candidates

Filter 1: 124 exact match

Test 2: 7 exact match

We found 131 authors among the 266:

Authors ESG expertise

Semantic scholar provides the list all papers for any authors. In the previous steps, we saved this list.

For instance, the author Abderrahman Jahmane wrote 3 papers in his carreer

[{'paperId': '44af7948d66a4dc62952a863e957faaa5770d13c', 'title': 'Corporate social responsibility and firm value: Guiding through economic policy uncertainty'}, {'paperId': '57bf8e616da8230ca7a961be19affeb8b8ae619d', 'title': 'Corporate social responsibility, financial instability and corporate financial performance: Linear, non-linear and spillover effects – The case of the CAC 40 companies'}, {'paperId': 'eff6f21cc09c572f3bdc8add0d0f43badecbf977', 'title': 'Accounting for endogeneity and the dynamics of corporate social – Corporate financial performance relationship'}]

image.png

Source: https://www.semanticscholar.org/author/Abderrahmane-Jahmane/122677227

We will use the paper's title to flag whether it deals with ESG or not.

For the 266 authors, we have collected about 14,443 unique papers. We rely on a naive technique to flag all of the 14.443 papers.

The technique is the following:

image.png

In total 857 papers deals with ESG among the 14,443 papers (5.9%)

The last step to create the authors table concatenates the common authors from Semantic scholar and Google scholar with the authors not in Google scholar. When the concatenatation is done, we compute the expertise score as follow, within author:

The distribution of expertise:

image.png

The distribution of gender:

image.png

Add paper information

Table meta_analysis_esg_cfp

Since the table to create has missing value, please use the following at the top of the query

CREATE TABLE database.table_name WITH (format = 'PARQUET') AS

Choose a location in S3 to save the CSV. It is recommended to save in it the datalake-datascience bucket. Locate an appropriate folder in the bucket, and make sure all output have the same format

First, we need to delete the table (if exist)

Clean up the folder with the previous csv file. Be careful, it will erase all files inside the folder

Use Semantic scholar to find ID

Add authors information:

Update Glue catalogue and Github

This step is mandatory to validate the query in the ETL.

Create or update the data catalog

The query is saved in the S3 (bucket datalake-london), but the comments are not available. Use the functions below to update the catalogue and Github

Update the dictionary

Update schema

If automatic = False in automatic_update, then the function returns only the variables to update the comments. Manually add the comment, then, pass the new schema (only the missing comment) to the argument new_schema.

To update the schema, please use the following structure

schema = [
    {
        "Name": "VAR1",
        "Type": "",
        "Comment": ""
    },
    {
        "Name": "VAR2",
        "Type": "",
        "Comment": ""
    }
]

The function below manages everything automatically. If the final table comes from more than one query, then pass a list of table in list_tables instead of automatic

Check Duplicates

One of the most important step when creating a table is to check if the table contains duplicates. The cell below checks if the table generated before is empty of duplicates. The code uses the JSON file to create the query parsed in Athena.

You are required to define the group(s) that Athena will use to compute the duplicate. For instance, your table can be grouped by COL1 and COL2 (need to be string or varchar), then pass the list ['COL1', 'COL2']

Count missing values

Update Github Data catalog

The data catalog is available in Glue. Although, we might want to get a quick access to the tables in Github. In this part, we are generating a README.md in the folder 00_data_catalogue. All tables used in the project will be added to the catalog. We use the ETL parameter file and the schema in Glue to create the README.

Bear in mind the code will erase the previous README.

Analytics

In this part, we are providing basic summary statistic. Since we have created the tables, we can parse the schema in Glue and use our json file to automatically generates the analysis.

The cells below execute the job in the key ANALYSIS. You need to change the primary_key and secondary_key

For a full analysis of the table, please use the following Lambda function. Be patient, it can takes between 5 to 30 minutes. Times varies according to the number of columns in your dataset.

Use the function as follow:

Check the job processing in Sagemaker: https://eu-west-3.console.aws.amazon.com/sagemaker/home?region=eu-west-3#/processing-jobs

The notebook is available: https://s3.console.aws.amazon.com/s3/buckets/datalake-datascience?region=eu-west-3&prefix=ANALYTICS/OUTPUT/&showversions=false

Please, download the notebook on your local machine, and convert it to HTML:

cd "/Users/thomas/Downloads/Notebook"
aws s3 cp s3://datalake-datascience/ANALYTICS/OUTPUT/asif_unzip_data_csv/Template_analysis_from_lambda-2020-11-22-08-12-20.ipynb .

## convert HTML no code
jupyter nbconvert --no-input --to html Template_analysis_from_lambda-2020-11-21-14-30-45.ipynb
jupyter nbconvert --to html Template_analysis_from_lambda-2020-11-22-08-12-20.ipynb

Then upload the HTML to: https://s3.console.aws.amazon.com/s3/buckets/datalake-datascience?region=eu-west-3&prefix=ANALYTICS/HTML_OUTPUT/

Add a new folder with the table name in upper case

Generation report